Air Quality Indicators and Disease Prevalence Across the US

Final Project
Data Science 1 with R (STAT 301-1)

Author

Cassie Lee

Published

November 28, 2023

Introduction

Air pollution is the presence of sufficient quantities of contaminants in the atmosphere for a duration that is long enough to cause harm to human health.1 Air pollution mainly enters the body through the lungs, and while it mostly impacts the heart, lungs, and brain, it has the potential to affect other organs as well by traveling through the bloodstream.2

In the EDA, I explore the relationships between air quality indicators, certain lung diseases, and how sociodemographic vulnerabilities affect the relationships between air quality and health in the United States (excluding territories). The goal of this analysis is to build a deeper understanding of how various air pollutants impact health across the United States.

I queried the data at the county level and downloaded from the CDC National Environmental Public Health Tracking Network interactive data explorer.3 This network was built to centralize environmental health data on the national, state, and county level across the United States.4

The three central questions I explored in this analysis were:

  1. How are various air and environmental quality indicators related to each other?

  2. What are the most important indicators for determining the prevalence of asthma, cancer, and chronic obstructive pulmonary disease?

  3. How do sociodemographics like age, gender, race, and the social vulnerability index affect the relationships between lung disease and air quality?

Data overview & quality

The air pollutants I selected includes the days over the ozone standard, the percent of days over the PM 2.5 standard, benzene, formaldehyde, acetaldehyde, carbon tetrachloride, and 1,3-butadiene pollution. The environmental quality indicators I selected includes the percent of people living near highways, the percent of public schools near highways, access to parks, and methods of transportation to work (walking, biking, driving alone, carpooling, public transportation, and none).

The indicators I selected for prevalence of lung diseases includes the crude prevalence of adult and child asthma, the crude and age adjusted rates of emergency department visits for asthma, age adjusted rates of lung and bronchus cancer, and the crude and age adjusted rates of chronic obstructive pulmonary disease.

The indicators of sociodemographic data I selected includes age, gender, race, and and the social vulnerability index5. I compared sociodemographic characteristics of each county to other counties in the United States and used this to identify counties with comparatively higher percentages of vulnerable groups (Appendix A). I also identified the majority race demographic in each county.

If available, I downloaded all data at the county level for 2018. Crude rates of child asthma and age adjusted rates of lung and bronchus cancer were only available at the state level. I downloaded data for the usual method of transportation to work for the time period of 2017 to 2021.

The final merged dataset includes 42 variables and 3144 observations matching each county and Washington DC. There are 4 identifying variables, 5 factor variables, 32 numeric variables, and one simple features geometry variable for mapping.

With the exception of Rio Arriba, Mexico, all counties have complete sociodemographic information. Rio Arriba, Mexico is missing data for the social vulnerability index. The prevalence of adult asthma is complete across all observations, but 1402 observations are missing for the prevalence of child asthma and 1789 observations are missing for crude and age adjusted emergency department visits for asthma. The majority of counties do not have information about the days over the ozone and PM 2.5 air quality standards. 1567 observations are missing data for the percent of public school near highways. All other variables are either complete or missing at most 1 observation.

Explorations

Indicators of air and environmental quality

To address the first central question, I used univariate variate analysis to understand the distribution of the indicators of air and environmental quality. I also used bivariate analysis to understand if and how certain indicators were related to each other.

Air quality: ozone and PM 2.5

As seen in Figure 1, the distribution of days over the ozone and particulate matter size 2.5 microns (PM 2.5) standards are both skewed right by observations with significantly more days over the standard. For both ozone and PM 2.5, most of the counties experienced no more than 20 days over the standard. However, the degree of right skewing for the distribution of days over ozone standard is significantly higher than for the distribution of days over the PM 2.5 standard.

Figure 1: Distribution of days over air quality standards.

Given that the distribution is so heavily skewed right, I was interested to see the distribution of the days over the ozone and PM 2.5 standards across the US. Although most of the data is missing, Figure 2 shows that Southern California has high levels of ozone pollution and Central California has high levels of PM 2.5 pollution. For PM 2.5, the days over the PM 2.5 standard aligns with the California wildfire incident map in 2018.6

Figure 2: Distribution of days over the ozone and PM 2.5 standard across the United States, excluding Hawaii and Alaska.

Figure 3 shows the relationship between the days over the ozone and PM 2.5 standard. There is a positive relationship between the two air quality indicators, indicating that counties with worse ozone pollution generally also have worse PM 2.5 pollution. This is consistent with research about the sources of ozone and PM 2.5 pollution, as they can both originate from nitrogen oxides from power plants, industrial pollution, and automobiles.7

However, there are a large number of counties that report having 0 days over the PM 2.5 standard while having several days over the ozone standard and several counties that report having 0 days over the ozone standard while having several days over the PM 2.5 standard. This is also consistent with research showing that these pollutants also have sources that produce one pollutant, but not the other. For example, construction sites, unpaved roads, fields, smokestacks and fires produce PM 2.5 pollution, but not ozone pollution.8

Figure 3: Relationship between days over the PM 2.5 standard and days over the ozone standard excluding outliers (over 50 days over one or both standards).

Air quality: benzene, formaldehyde, acetaldehyde, carbon tetrachloride, and 1,3-butadiene

The other group of air quality indicators I was interested in were benzene, formaldehyde, acetaldehyde, carbon tetrachloride, and 1,3-butadiene concentrations. Figure 4 shows the distribution of these air pollutants. Benzene, formaldehyde, acetaldehyde, and 1,3-butadiene are skewed right, indicating that there are several counties that have unusually high levels of these pollutants. This is expected, as counties with usually high industrial pollution would cause this distribution. However, carbon tetrachloride is skewed left, indicating that there are several counties that have unusually low levels of these pollutants. One reason that this distribution may differ from the others is that carbon tetrachloride is not naturally occuring, while the other pollutants are. Thus counties with usually low levels of carbon tetrachloride may be counties that have never had high levels of carbon tetrachloride exposure, and thus are capable of having extremely low values.9

Figure 4: Distribution of five air pollutants.

Then, I explored how these five air pollutants were correlated with each other. Figure 5 shows a correlation matrix of these air pollutants. Formaldehyde and acetaldehyde are highly correlated with each other, while the rest of the pollutants were somewhat or barely correlated with each other. Carbon tetrachloride and benzene are somewhat correlated, and 1,3-butadiene is somewhat correlated with both formaldehyde and acetaldehyde. Benzene is barely correlated with formaldehyde and acetaldehyde. Correlation between these pollutants indicate similar sources of pollution that emit multiple pollutants at the same time.

Figure 5: Correlation matrix of five air pollutants.

Air quality: combined

After exploring how these five pollutants were correlated with each other, I wanted to see if they were correlated with ozone and PM 2.5 pollution. Figure 6 shows that ozone and PM 2.5 pollution are not particularly correlated with the other 5 pollutants. This is likely because the five chemicals are nearly striclty from industrial pollution, while ozone and PM 2.5 pollution can have significant non-industrial sources, such as from automobiles.

Figure 6: Correlation matrix of all air quality indicators.

Environmental quality

Upon bivariate analysis between air quality indicators and environmental quality indicators, I decided not to move forward with analysis including environmental indicators because they were not particularly predictive of air quality. For example, I had originally suspected that the percent of population living near a highway would be predictive of ozone levels, however, Figure 7 shows that it is not. This lack of relationship between environmental quality and air quality suggested that continuing to explore indicators of environmental quality would not help me answer my three main questions.

Figure 7: The relationship between the days over ozone standard and the percent of people living within 150 M of a highway is an example of how environmental quality indicators were not particularly predictive of air quality indicators.

Lung disease and air quality indicators

Once I had an understanding of how the air quality indicators were related to each other, I was interested in seeing how air pollution and various lung diseases were related.

Asthma

Figure 8 shows that although ozone and PM 2.5 pollution are known to aggravate lung diseases such as asthma, there is not particularly clear relationship between emergency department visits for asthma and ozone or PM 2.5 pollution.10 It is possible that given better predictions and access to air quality information online, individuals with asthma are better able to avoid long exposures to high ozone and PM 2.5 levels, allowing them to avoid aggravating their asthma.

Figure 8: The crude rate of emergency deartment visits for asthma per 10 K population as a function of days over the ozone and PM 2.5 standard.

However, Figure 9 shows a clear positive relationship between the prevalence of asthma and exposure to the pollutants formaldehyde and acetaldehyde for both adults and children. Since the relationship between asthma and these two pollutants holds in both childhood and adulthood, I suspect that exposure to these pollutants in childhood is can be associated with the development of asthma, which is then carried into adulthood. The relationship between childhood asthma and formaldehyde exposure has been supported by various studies.11 There is limited and conflicting evidence about the long term effects of acetaldehyde exposure, so the positive relationship between childhood asthma and acetaldehyde exposure in these graphs may just be a result of the extremely strong correlation between formaldehyde and acetaldehyde.

Figure 9: The prevalence of adult and child asthma prevalence (percent of population) as a function of formaldehyde and acetaldehyde concentrations.

Lung and bronchus cancer

To explore how lung and bronchus cancer was associated with the air quality indicators, I used a correlation matrix to identify potentially interesting relationships to explore. Figure 10 shows that the measure of days over the ozone and PM 2.25 standard were not positively correlated with cancer, but it is important to note that there was a lot of missing data for these two indicators. On the other hand, there are relatively strong positive correlations between lung and bronchus cancer and the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride.

Figure 10: Correlation matrix of lung and bronchus cancer and air quality indicators.

Figure 11 visualizes the relationship between the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride and the prevalence of lung and bronchus cancer. As expected, formaldehyde and acetaldehyde have very similar relationships with the prevalence of lung and bronchus cancer. However, it is surprising that the relationship between lung and bronchus cancer and carbon tetrachloride has such a high correlation and a relatively high slope because this pollutant primarily affects the liver, kidneys, and central nervous system.12 The main carcinogenic properties affect the liver, not the lungs.13 Figure 10 shows that carbon tetrachloride is most strongly correlated with benzene, however, benzene is not strongly correlated with cancer risk. Thus, carbon tetrachloride is likely correlated with a different carcinogenic air pollutant which affects the respiratory system that was not explored here.

Figure 11: The prevalence of lung and bronchus cancer per 100 K population as a function of the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride.

Chronic obstructive pulmonary disease

Finally, to explore the relationship between chronic obstructive pulmonary disease (COPD) and air quality, I created another correlation matrix to identify potentially interesting relationships. Figure 12 and Figure 13 show that like lung and bronchus cancer, formaldehyde, acetaldehyde, and carbon tetrachloride pollution were strongly correlated with COPD. However, unlike the relationships for lung and bronchus cancer, formaldehyde and acetaldehyde have higher slopes. This is consistent with studies showing that formaldehyde exposure through inhalation increases the risk of COPD.14 The relationship between COPD and acetaldehyde is likely just a result of the strong correlation between formaldehyde and acetaldehyde because acetaldehyde is not known to have chronic health effects.

Figure 12: Correlation matrix of COPD and air quality indicators.
Figure 13: Age adjusted percentage of COPD as a function of the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride.

Sociodemographic effects on lung disease and air quality

Demographic age vulnerability

Given that poor air quality generally affects the young and the old, I explored how the distribution of age affected the relationship between lung disease and air quality. I identified counties with a relatively high proportion of young people or a relatively high proportion of older people as vulnerable.

Although Figure 8 did not show a clear relationship between emergency department visits and days over the ozone and PM 2.5 standards, Figure 14 highlights how counties with a high population of young or old individuals do in fact expect to see an increase in emergency department visits for asthma given poor air quality. For counties that are not vulnerable by age demographics, emergency department visits still does not have a clear relationship with ozone. However, for PM 2.5 pollution, emergency department visits decrease with increasing number of days over the PM 2.5 standard. This may still reflect the tendency of individuals to use air quality forecasts to limit exposure.

Figure 14: The crude rate of emergency deartment visits for asthma per 10 K population as a function of days over the ozone and PM 2.5 standard, disaggregated by demographic age vulnerability.

I was also interested in exploring how vulnerability by age demographics would affect the relationship between the prevalence of cancer and air pollutants. Figure 15 shows how the relationship between lung and bronchus cancer and air pollutants is the same across age vulnerabilities. However, counties that are vulnerable by age demographics generally have a lower prevalence of cancer. This is likely because children typically do not have enough time to develop lung and bronchus cancer at a young age, and people who have had lung and bronchus cancer may not live until older ages, so they would not be included in the population statistics.

Figure 15: The prevalence of lung and bronchus cancer per 100 K population as a function of the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride, disaggregated by demographic age vulnerability.

Figure 16 shows that demographic age vulnerability had a similar effect on the relationship between COPD and air pollutants. However, it seems that for counties that did not have a high population of young and old individuals, the effect of formaldehyde and acetaldehyde pollution on COPD prevalence is greater. This is also likely another effect of how chronic diseases develop and affect age distributions.

Figure 16: Age adjusted percentage of COPD as a function of the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride, disaggregated by demographic age vulnerability.

Gender vulnerability

Given that there are often differences in exposure to environmental hazards between men and women,15 I was interested to see if counties with a relatively higher proportion of women had a different relationship with asthma and air quality than counties with a relatively lower proportion of women. Figure 17 shows that in general, the effect of poor air quality on emergency department visits for asthma was larger for counties with a relatively higher proportion of women. The exception to this is PM 2.5, and this could be due to gendered differences in risk perception for poor air quality.16

Figure 17: The crude rate of emergency deartment visits for asthma per 10 K population as a function of days over the ozone and PM 2.5 standard and the pollutnats formaldehyde and acetaldehyde, disaggregated by gender vulnerability.

For both lung and bronchus cancer and COPD, the effect of gender demographics on the relationship between air quality and disease is generally the opposite than for asthma. Figure 18 shows that increased formaldehyde and acetaldehyde concentrations have a larger effect on lung and bronchus cancer and COPD for counties with a relatively high proportion of men (lower proportion of women). With the exception of carbon tetrachloride, the gendered differences in the development of chronic lung diseases in response to air pollution may reflect gendered differences in exposure duration, possible through occupation.17

Figure 18: The prevalence of lung and bronchus cancer per 100 K population and the age adjusted percentage of COPD as a function of the pollutnats formaldehyde, acetaldehyde, and carbon tetrachloride, disaggregated by gender vulnerability.

Race

I was interested in how the predominant race in each county affected the relationship between lung disease and air quality. However, there are so few counties that are predominantly American Indian Alaskan Native and Asian Pacific Islander, that that comparisons can only be made between predominantly Black and white counties. However, limited variability within the disaggregation of predominant race greatly limited the ability to draw conclusions about the effect of predominant race on the relationship between lung disease and air quality.

Figure 19 shows that predominantly Black counties have higher levels of both formaldehyde and acetaldehyde pollution and a higher prevalence of asthma, however, the effect of increased air pollution on the prevalence of asthma is either equal to or less than in predomnantly white counties. This may be a limitation of lower variation in pollution levels in counties that are predominantly Black.

Figure 19: The prevalence of adult and child asthma prevalence (percent of population) as a function of formaldehyde and acetaldehyde concentrations, disaggregated by predominant race.

Unexpectedly, Figure 20 shows that in predominantly Black counties, as the level of formaldehyde and acetaldehyde pollution increases, the prevalence of lung and bronchus cancer decreases. Even though the confidence interval for predominantly Black counties is fairly large, it still clearly shows a different relationship than for predominantly white counties. For carbon tetrachloride, however, all predominant races have very similar effects on the relationship between cancer and pollution, just with varying levels of cancer prevalence overall.

Figure 20: The prevalence of lung and bronchus cancer per 100 K population as a function of the pollutnats formaldehyde, acetaldehyde, and carbon tetrachloride, disaggregated by predominant race.

Figure 21 shows that between predominantly Black and white counties, the predominant race does not particularly affect the relationship between COPD and the pollutants formaldehyde and acetaldehyde. The relationship between COPD and carbon tetrachloride in predominantly Black counties is nearly opposite of the relationship for predominantly white counties, but I suspect this is due to the limited variability in the level of carbon tetrachloride pollution measured in predominantly Black counties.

Figure 21: Age adjusted percentage of COPD as a function of the pollutants formaldehyde, acetaldehyde, and carbon tetrachloride, disaggregated by predominant race.

Social Vulnerability Index

Initially, I was interested in exploring how the social vulnerability index (SVI) of each county affected the relationship between lung disease and air quality. However, upon exploration, there was little difference in the relationship between lung disease and air quality for counties with a high SVI score compared to a low SVI score. Figure 22 shows that although counties with high social vulnerability have a slight negative slope for the prevalence of asthma and an overall higher prevalence, the relationship between increased air pollution and the prevalence of asthma is not much different than for counties with low social vulnerability. Similarly, although the prevalence of COPD is higher for counties with high social vulnerability, the slope is not much different from counties with low social vulnerability. For lung and bronchus cancer, social vulnerability has an extremely negligible effect on the relationship between the prevalence of cancer and formaldehyde pollution.

I suspect that part of the reason SVI does not greatly affect the relationship between lung disease and air pollution is that SVI was designed to capture the social vulnerability in responding to emergency events such as hurricanes, disease outbreaks, or exposure to dangerous chemicals, whereas my explorations focus more on the chronic effects of long term air pollution.18

Figure 22: The prevalence of adult asthma (percent of popultion), lung and bronchus cancer per 100 K population, and COPD (age adjusted percentage of population) as a function of formaldehyde pollution, disaggregated by SVI.

Conclusions

Sources

Appendix

A. Identifying sociodemographic vulnerabilities